This is an updated version of a post we originally published in 2020. You can read the original version here.
The growth of the data infrastructure industry has continued unabated since we published a set of reference architectures in late 2020. Nearly all key industry metrics hit record highs during the past year, and new product categories appeared faster than most data teams could reasonably keep track. Even the benchmark wars and billboard battles returned.
To help data teams stay on top of the changes happening in the industry, we’re publishing in this post an updated set of data infrastructure architectures. They show the current best-in-class stack across both analytic and operational systems, as gathered from numerous operators we spoke with over the last year. Each architectural blueprint includes a summary of what’s changed since the prior version.
We’ll also attempt to explain why these changes are taking place. We argue that core data processing systems have remained relatively stable over the past year, while supporting tools and applications have proliferated rapidly. We explore the hypothesis that platforms are beginning to emerge in the data ecosystem, and that this helps explain the particular patterns we’re seeing in the evolution of the data stack.
To compile this work, we relied again on input from dozens of data experts, who are listed at the end of this post. This simply wouldn’t exist without them, so thank you!
Before we get too deep in the details, here are the latest architecture diagrams. These were compiled with the help of leading data practitioners, based on what they run internally and what they recommend for new deployments.
The first view shows a unified overview across all data infrastructure use cases:
The second view zooms in on machine learning, which is a complex and increasingly independent tool chain:
In the rest of this post, we’ll comment on what’s changed since v1 of the data stack and explore the underlying root causes.
Despite the frenzy of data infrastructure activity over the past year, it’s surprising to see — in some ways — how little has changed.
In our first post, we drew a distinction between analytic systems that support data-driven decision-making and operational systems that power data-driven products. We then mapped these categories to three patterns, or blueprints, often implemented by leading data teams.
One of the key questions was whether these architectural patterns would converge. A year later, that doesn’t seem to have taken place.
In particular, the analytic and operational ecosystems both continue to thrive. Cloud data warehouses like Snowflake have grown rapidly, focused largely on SQL users and business intelligence use cases. But adoption of other technologies has also accelerated — data lakehouses like Databricks, for instance, are adding customers faster than ever. Many data teams we spoke with confirmed that heterogeneity is likely here to stay in the data stack.
Other core data systems — namely, ingestion and transformation — have proven similarly durable. This is especially visible in the modern business intelligence pattern, where the combination of Fivetran and dbt (or similar technologies) has become nearly ubiquitous. But it’s also true to an extent in operational systems, where de facto standards like Databricks/Spark, Confluent/Kafka, and Astronomer/Airflow have emerged.
Around the stable core, the data stack has evolved rapidly over the past year. Broadly speaking, we’ve seen the most activity in two areas:
We’re also seeing the introduction of some new technologies designed to enhance core data-processing systems. Notably, there has been active debate around the metrics layer in the analytical ecosystem and the lakehouse pattern for operational systems — both of which are converging toward useful definitions and architectures.
With that context, we’ll go into detail on each of the major data infrastructure blueprints. Each section below shows an updated diagram (diff’d against v1 of the stack) and an analysis of key changes. These sections are intended primarily as reference for data teams implementing these stacks, and reading them isn’t necessary to follow the rest of the post.
Cloud-native business intelligence for companies of all sizes
What hasn’t changed:
What’s new:
Evolved data lakes supporting both analytic and operational use cases – also known as modern infrastructure for Hadoop refugees
What hasn’t changed:
What’s new:
Stack for robust development, testing, and operation of machine learning models
What hasn’t changed:
What’s new:
To recap: Over the past year, the data infrastructure stack has seen substantial stability in core systems and rapid proliferation of supporting tools and applications.To help explain why this might be happening, we introduce here the idea of data platforms.
The word “platform” is overloaded in the data ecosystem, often used by internal teams to describe their whole tech stacks or by vendors to sell loosely connected product suites.
In software more broadly, a platform is something other developers can build on top of. Platforms generally provide limited value on their own — most users have no interest, for instance, in accessing the guts of Windows or iOS. But they provide an array of benefits, like a common programming interface and a large installed base, that allow developers to build and distribute the applications users ultimately care about.
The defining trait of a platform, from an industry standpoint, is mutual dependence — both technically and economically — between an influential platform provider and a large pool of 3rd-party developers.
Historically, the data stack has not been an obvious fit for the definition of a platform. Mutual dependence existed — among ETL, data warehouse, and reporting vendors, for instance — but the integration model tended to be one-to-one, rather than one-to-many, and was supplemented heavily by professional services.
According to a number of data experts we spoke with, this may be starting to change.
The platform hypothesis argues that the “backend” of the data stack — roughly defined as data ingestion, storage, processing, and transformation — has started to consolidate around a relatively small set of cloud-based vendors. As a result, customer data is being collected in a standard set of systems, and vendors are investing heavily to make this data easily accessible to other developers — as a fundamental design principle in systems like Databricks, and via SQL standards plus custom compute APIs in systems like Snowflake.
“Frontend” developers, in turn, have taken advantage of this single point of integration to build out a range of new applications. They rely on clean, joined data in the data warehouse/lakehouse, without worrying about the underlying details of how it got there. A single customer may buy and build many applications on top of one core data system. We’re even starting to see traditional enterprise systems, like financial or product analytics, being rebuilt with a “warehouse-native” architecture.
The picture might look like this:
To be clear, this doesn’t mean that OLTP databases or other important backend technologies will disappear in the near future. But native integration with OLAP systems may become a critical component of application development. And over time, more and more business logic and application functionality could transition to this model. We may see a whole class of new products built on this data platform.
The data platform hypothesis is still very much open to debate. However, we are seeing an increase in sophisticated vertical SaaS solutions implemented as horizontal layers on top of the data platforms. And so, while early, we’d argue that the changes taking place in the data stack are at least consistent with the idea that platforms are taking hold.
There are many reasons, for example, that companies like Snowflake and Databricks have become stable pieces of the data stack, including great products, capable sales teams, and low-friction deployment models. But there’s also a case to be made that their stickiness is reinforced by platform dynamics — once a customer has built and/or integrated a range of data applications with one of these systems, it often doesn’t make sense to transition off.
A similar argument can be made for the surge of new data infrastructure products in recent years. The typical explanations for this trend have to do with vast troves of data, increasing corporate budgets, and a glut of VC funding. But those things have arguably been true for decades. The reason we’re seeing so many new products appear now may have to do with platforms — namely, that it’s never been easier to get a new data application adopted, and it’s never been more important to properly maintain the platform.
Finally, the platform hypothesis provides some predictive power in terms of competitive dynamics. At scale, platforms can be extremely valuable. Core data systems vendors may be competing aggressively today not just for current budgets, but for a long-term platform position. Eye-popping valuations for data ingestion and transformation companies — or especially heated debates over new categories like the metrics layer or reverse ETL — also make more sense if you believe they are a core part of the emerging data platform.
We’re still in the early stages of defining the analytical and operational data platform, and the pieces of the platform are in flux. As such, it’s probably more useful as an analogy than as a strict definition. But it may be a useful tool to filter signal from noise, and to help develop a sense of why the market is moving the way it is. Data teams now have more tools, resources, and organizational momentum behind them than at any point (likely) since the invention of the database. And we’re very excited to see how the app layer evolves on top of the emerging platforms.
List of contributors to Emerging Data Architectures (all versions): Peter Bailis, Mike del Balso, Max Beauchemin, Scott Clark, Jamie Davidson, George Fraser, Krishna Gade, Ali Ghodsi, Abe Gong, Nick Handel, Tristan Handy, Shinji Kim, Mars Lan, Xiangrui Meng, Clemens Mewald, Bob Muglia, Jad Naous, Robert Nishihara, Diego Oppenheimer, Amit Prakash, Ori Rafael, Praveen Rangnath, Nick Schrock, Benn Stancil, Carl Steinbach, Ion Stoica, Kevin Stumpf, Arsalan Tavakoli, Venkat Venkataramani, Don Vu, Reynold Xin, FJ Yang, Matei Zaharia.
Matt Bornstein is a partner at Andreessen Horowitz focused on AI, data systems, and infrastructure.
Jennifer Li is a General Partner at Andreessen Horowitz, where she focuses on enterprise and infrastructure investments in data systems, developer tools, and AI.
Martin Casado is a general partner at Andreessen Horowitz, where he leads the firm's $1.25 billion infrastructure practice.